Defining the objective function and all the needed dervatives

The function dervatives in python and markdown/latex

setting up the function and the needed dervatives

The Plotting

We wrap the plotting code in a function for reusablilty.

define the Jacobian and Hessian of the function

Let Jacobian of f be

Jacob(x,y): $\begin{bmatrix} \frac{df}{dx} & \frac{df}{dy}\\ \end{bmatrix}$

Let Hessian of f be

Hess(x,y): $\begin{bmatrix} \frac{df}{dx^2} & \frac{df}{dxdy}\\ \frac{df}{dydx} & \frac{df}{dy^2} \end{bmatrix}$

Extra funxction to help with Newton Method

Sampels Generation

Using np.random.uniform we sample a sample with 2 dimentions. we do this 10000 times and save the samples generated in a list to get a list of 10000 samples from the region (-2,2)X(-2,2)

Random Search

Random search is done by generating point in the the ROI and loop through them, setting the one that returns the minimum function output as our best sample/chooice

Notes:

performance with the goldstein-price function:

Since the algoritm used pre-defined samples point that do not get updated, unlike gradeiant decent and newton method. the function features such as many local minimums around the global min and the unevenness of the surface of the function do not interfer with the algorithm run.

Therefore in theory all that we need to actually get a good guess of the global min is large enough number of sample and good guess of thr region that enclose the global min. And since we have been given the region (-2,2)X(-2,2) we get "decent" resualts but the limit here is the $10^{3}$ sample wchice is not that large of number for this use.

a run with $10^{7}$ samples still does not get us closer to the min than the $10^{3}$ sample run. Therefore in practice Random search in that great for optimzation and have limited uses.

Gradient Decent

In Gradient Decent we use $grad(f(x))$ to get a vector that points to the greatest increase for a point cord X. then we move our current location point in the oppiste direction.

When using GD we do not follow the change all at once but instead we multiply the gradient with a factor, here a call it the learning rate due to my experance with GS in Deep Learning.

The GD method while great at any dimention d, it is nor garanteded to converge and might get stuck in a local minimum or the gradient might go very very big, we call that in deep learning graiednt explosion.

Notes:

Gradient Decent algorithm is well know optimzation algorithm in Mechain learning(ML). since it scales well with dimentionallity and it is not that hard or expensive to comute the first dervatives of a function

but as we can see from the code checks, Gradeint decent is prone to a lot of problems most notiable when the gradient increase to infinty and also the gradient is not a garanted to converge:

In ML gradient explosion is dealt with using gradient cliping where the grad(f) value is set to a max. e.g $min(val_{max},grad(f))$ And for the gradient setteling in local minimums, A "momentum" is introduced. Momentum is basiclly the adjusment of the learning rate as the gradient decent algorithm moves along to give the algorithm a better chance to "skip" over local minimum. one optimizor that i have used with such feature is the well-know Adam optimizor.

Here we get another possible problem, as we can observe from the above figures, choocing good stopping condition $\epsilon$ and learning rate $lr$ greatlly affect the algorithm abilly to converge. Gradient Decent with $lr = 10^{-3}$ at $P_{intial}$ might diverge or even explode within 1 or 2 iteration but if we used $lr = 10^{-4}$ at the same point $P_{intial}$ well converge with no problem.

Lastly the gradient decent does not compleltly converge per sa put gets very close the minimum and settles near it. The problem is that the gradent might also get near but so near to trigger the stoping condition thus the algorithm get stuck bouncing around for ever less stopped. In the code I put an iteration time out and in ML the model training using gradent decent is done in rounds called epoches with extra optimzations like the prevously mentioned momentom but also with early stopping checks being used as well.

Newton Method

Newton Medthod for muli-dimentional function is the generilzation of the well know Newton Method for d diemntion where d > 1.

The standered equation fo NM is $x_{i+1} = x_i - \frac{f}{f'}$ for optimzation instead of $\frac{f}{f'}$ we use the first and seconde dervatives of f $\frac{f'}{f''}$ but since we will you two variable function the dervative of a function $f(x,y)$ is actually a vector/matrix of the partial dervatives of function with recpect every to ever variable. for example $f'(x,y) = \begin {bmatrix} \frac{df}{dx},\frac{df}{dy} \end{bmatrix}$ and $f''(x,y) = \begin {bmatrix} \frac{df}{dx^2} & \frac{df}{dxdy}\\ \frac{df}{dydx} & \frac{df}{dy^2} \end{bmatrix}$

but we subbing back $f'(x,y)$ and $f''(x,y)$ in $x_{i+1} = x_i - \frac{f}{f'}$ we get a problem the divistion of a matrix is undefined but we have what is called 'Inverstion'

rewriting NM with matrix inverstion we get $x_{i+1} = x_i - (f''^{-1})(f')$ where $f''^{-1}$ is the inverse of $f''$.

Here $f'$ is called the Jacobian of $f$ and $f''$ is know as the Hessian of $f$.

save point that work: